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Is the Use of the Difference Likelihood Ratio Chi-square 
Statistic for Comparing Nested IRT Models Justifiable? 



Abstract 

The main purposes of this research are to investigate, by means 
of simulation, (a) whether the difference likelihood ratio chi-square 
statistic, G 2 dif , for comparing IRT models is asymptotically distributed 
as a chi-square distribution and (b) the accuracy rate of applying G 2 dif 
in selection of nested IRT models. The results based on this study 
demonstrate that the usual practice of treating the G 2 dif as distributed 
as a central chi-square distribution is not sound. For short test length, 
the proportion of times the correct model is being selected can be very 
low. It appears that the G 2 dif are more likely to be distributed as a 
noncentral chi-square distribution. Discussion concerning the 
proportion of the right model being selected by the difference statistic 
as well as its relative merits in comparison to the AIC and the m k 
indices is also included in this study. 

KEY WORDS: Difference Likelihood Ratio Chi-square Statistic, Difference Chi- 
square Statistic, Item Response Theory (IRT), BILOG, Likelihood Ratio Chi-square 
Statistic, Pearson Chi-square Statistic. 
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Model Selection in IRT Models 2 

I. Introduction 

The issue of model-data tit is of major concern when item response theory 
(IRT) models are used for practical testing. More specifically, the concern is whether 
or not test items fit the model assumed by practitioners (see, e.g., McKinley & Mills, 
1985; Reise, 1990; Rogers, 1987; Smith, 1991; Yen, 1981). It is of course desirable 
to have most items fit the assumed model. If not, especially in the presence of 
numerous mis-fit items, the issue of model choice is critical. Currently, identifying 
the most parsimonious test model that retains the integrity of the observed data is an 
important motivation behind item-fit studies (Yen, 1981). However, the issue of 
selecting an appropriate IRT model has received less attention than the study of item 
fit. Poor model choice can lead to inaccurate conclusions of item dregging, as well as 
inappropriate assessment of differential item functioning. 

To what extent can a certain IRT model be used to model a given set of 
examinees' responses to a test? A common method is to use the likelihood ratio chi- 
square goodness-of-fit statistic to measure the degree of data-model fit, as is also the 
case in latent class analysis and structural equation modeling, etc.. It is generally 
assumed that the likelihood ratio chi-square statistic is asymptotically distributed as a 
chi-square distribution with appropriate degrees of freedom as specified by the model. 
However, this statistic is not completely valid. This is because, in view of the 
numerous possible combination in response patterns, the frequency count of 
examinees in many response patterns will be very sparse, thereby violating the 
assumption of chi-square statistic that requires most of the expected frequencies be at 
least equal to five (see, e.g., Bock & Aitkin, 1981; Gitomer & Yamamoto, 1991; 
Reiser, VandenBerg, 1994). 

Furthermore, a distinct characteristic in the IRT framework is that item 
parameter estimates derived from the joint maximum likelihood estimation may not 
be consistent as sample size and the number of items increase. This is because the 
abilities of the examinees are unknown and must be estimated along with item 
parameters (refer to Baker, 1992 for detailed discussion). Item parameters estimated 
by marginal maximum likelihood method (Bock & Aitkin, 1981) do not depend on 
the direct estimation of examinees’ abilities, but rather on their ability distribution 
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(Baker. 1992). Thus, the estimate of the likelihood ratio chi-square statistic could be 
affected by special characteristics of the item parameter estimates. 

Traditionally, the difference or component chi-square (G 2 dif ) is used for the 
comparison of the relative fit of various IRT models with different parameter 
restrictions (e.g. Bock & Aitkin, 1981; Fischer & Parzer, 1991; Gitomer & 

Yamamoto, 1991), as well as for the assessment of differential item functioning 
(Camilli & Shepard, 1994; Thissen, Steinberg & Wainer,1993). This statistic is 
basically a ratio of the likelihood ratio chi-square statistic derived from the compact 
model to that derived from the general or subsuming model. Numerically, it is 
computed as the change in likelihood ratio chi-square statistics between a pair of 
hierarchically related models. Based on the additivity property of the likelihood ratio 
chi-square, G 2 djf is usually presumed to be asymptotically distributed as a chi-square 
distribution, with its degrees of freedom equal to the difference of the degrees of 
freedom between the two corresponding models. Yet as discussed above, the 
likelihood ratio chi-square statistic corresponding either to the subsuming or the 
nested model may not be chi-square distributed in the first place. Hence, the overall 
question raised in the present study is whether the use of the difference likelihood 
ratio chi-square statistic for comparing hierarchically nested IRT models (i.e. one 
model is a constrained form of the other) valid? 

More specifically, the main purposes of this research are to investigate, by 
means of simulation, (a) whether the difference likelihood ratio chi-square statistic for 
comparing IRT models is asymptotically distributed as a chi-square distribution and 
(b) the accuracy rate of applying G 2 dif in selection of nested IRT models. Discussion 
concerning the proportion of the right model being selected by the difference statistic 
as well as its relative merits in comparison to the AIC (Bozdogan, 1987) and the m k 
indices (McDonald & Mok, 1995) will also be included in this study. 

Presented are a brief review of background theory in section two, a description 
of the methodology in section three, the results and discussion in section four, and the 
conclusion in section five. 
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Model Selection in IRT Models 4 

II. The Likelihood Ratio Chi-square for Model Comparison 



A. Likelihood Function of IRT Models 

Under the three-PL logistic model (see Baker, 1992; Hambleton & 
Swaminathan, 1985; Mislevy & Bock, 1990), the probability, Py, of a correct response 
to the i th item for the j th examinee with ability 0j is given by: 



Pij(0j) = Ci + (1 - Ci ) 



^D ai (0j*bi) 

J + ^Dai(0j*bj) 



( 1 ) 



where a, is the item discrimination, b, is the item difficulty, Cj is the lower asymptote 
parameter (also known as the guessing parameter), and D (usually equal to 1.702) is a 
scaling factor. A two-PL model is attained if the guessing parameter c s is constrained 
to zero for all items in (1) above. A one-PL model is a restricted form of the two-PL 
model by further constraining the item discrimination index a ( to be identical or equal 
to one for all items. 

Assuming that the local independence assumption holds, given an examinee 
with ability 0 who responds to a set of n items with the response pattern u, then the 
probability of obtaining the response pattern u given 0 and the item parameter vector 
£ (a, b, c) can be computed by: 



p(ui0^)=npf ,i Q!' u ' ( 2 ) 

i=l 

where Q = 1 - P. If 0 is randomly sampled from a density distribution of ability g(0), 
the unconditional probability is given by (see Baker, 1992; Mislevy & Bock, 1990): 

P(u,0|^) = nPrQ,'- ul g(0) (3) 

i = l 

Then, the marginal probability of obtaining the response pattern u is obtained by 
integrating out the ability parameter Q from the left side of (3), thereby giving: 

P(yl^) = P(u|0,^>g( 0 ) d 0 = Ttu (4) 
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Model Selection in IRT Models 5 

As a result, the item parameters can be estimated without the estimation of the 0 
parameters. This marginal probability of obtaining the response patterns u_, P ( u I % ). 
hereafter denoted by tt u . can be approximated to any desired degree of accuracy by the 
Gauss-Hermite quadrature via computing the sum: 

£P(x = u Xx)A(X k ) (5) 

k 

Here X k represents a tabled quadrature point (node), and A(X k ) is the corresponding 
weight which is related to the height of the density g(0) in the neighborhood of the 
node X k (see Stroud & Sechrest, 1 966 for details). 

Now let the subscript u represent a specific response pattern, r u denote the 
number of examinees obtaining that specific response pattern, and s represent the 
number of distinct response patterns observed. In general, there are 2 n possible 
response patterns for n binary items, hence s < 2 n , ignoring those patterns with r u = 0. 
Thus, 

i> = N (6) 

u 

where N is the sample size. The likelihood function is then defined as the joint 
probability of all examinee's response patterns and is given by 

L«fl(7r u r (7) 

u = l 

Taking logarithm of Equation 7 results in 

s 

lnL = k + r u ]Tl n (7t u ), (8) 

u=l 

where k is a constant which does not influence the estimation of the item parameters. 
The item parameter estimates are obtained by maximizing the log likelihood function 
presented in Equation 7. More specifically, they are obtained by differentiating In L 
with respect to the item parameters a, b, and c, and solving the subsequent likelihood 
equations simultaneously. If the underlying shape of the ability distribution is 
correctly specified, the marginal maximum likelihood estimator can be consistent as 
the number of test items and the sample size increase (refer to Seong, 1990). 

7 
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When data from a large sample of examinees is available, the model-data fit 
may be tested either for the whole test or item by item. The likelihood ratio goodness- 
of-fit chi-square statistic for testing the assumed model against a general multinomial 
alternative is given by (Bock & Aitkin, 1981) 






( 9 ) 



U= I 



where G 2 is equal to -2 In L as presented in Equation 8. This statistic may be 
asymptotically distributed as a chi-square distribution with s - mn -1 degrees of 
freedom (where m is the number of item parameters in the model). 

However, if the number of all possible response patterns (2 n ) is large relative to 
the sample size N, then most of the expected frequencies of the response patterns 
(7t u N) will be less than 5. This setting is quite common in practical testing situations. 
Bock and Aitkin(1981) suggested that the frequencies of response patterns with small 
expectations should be pooled until all expected frequencies equal or exceed 5. After 
pooling, the likelihood ratio chi-square statistic with s p - mn -1 degrees of freedom 
(s p is the number of response patterns after pooling), then provides a conservative test 
of data-model fit. Unfortunately, the likelihood function in Equation 8 has not actually 
been maximized in the pooled data (Bock & Aitkin, 1981). In addition, the way of 
pooling data is subjective and no IRT computer software at present can provide this 
kind of test when pooling data is necessary. Consequently, the fit of the model has 
seldom been assessed for a whole test in studies reported in the literature, except for 
the case of a short test with a large of number of examinees, such as the empirical 
dataset (5 test items by 1000 examinees) of the Law School Aptitude Test (LSAT) 
that were used in, among others, Bock and Lieberman (1970). 



C. Model Selection in IRT Models 

Regarding the choice of an appropriate IRT model, although the true model is 
never known, the usual practice is to determine if some models fit an observed dataset 
better than the others. One way to do this is to compute the likelihood ratio goodness- 
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Model Selection in IRT Models 7 
of-fit chi-square statistics of the relevant models together with their associated degrees 

of freedom. But as explained above, the likelihood ratio chi-square statistics, in most 

cases, may not be appropriate for assessing data-model fit in IRT modeling. An often 

used alternative approach is to compare the relative fit of the two IRT models. Let 

G 2 g be the likelihood ratio chi-square statistic of a more general IRT model, while G 2 C 

is the corresponding statistic of a more constrained or nested version of the former 

model. The statistic used to assess the improvement in fit of the augmented model 

over the compact model is by the difference chi-square which can be expressed as: 

G 2 dif - G 2 c - G 2 g = -2(ln L c ) + 2(ln L g ) (10) 

where In L c and In L g can be derived from Equation 8. (Note that even for the same 

dataset, different IRT computer programs may handle either the constant k or the 

metric of the item parameters differently. Otherwise, they may employ different 

estimation algorithms. Thus the value of the -2 log likelihood reported in the 

computer output may differ from program to program). 

It is important to notice that several assumptions have to be satisfied in order 
for the G 2 dif statistic to be approximately distributed as a chi-square distribution with 
its degrees of freedom equal to the degrees of freedom of the nested model minus that 
of the subsuming model. They are, among others, (a) the two models should be 
hierarchically related, (b) the fundamental IRT assumptions have been met in the 
estimation of both models, and (c) the more general of the two models provides a 
more proper specification for the data (see Holt & Macready, 1988). 

Other properties of the difference chi-square statistic have been pointed out by 
Steiger, Shapiro & Browne (1985) in their seminal paper. Specifically, they 
demonstrated that the asymptotic intercorrelations among the chi-square statistics 
calculated for hierarchically related models on the same dataset can be quite high. 
However, the intercorrelations between the chi-square statistics and the sequential chi- 
square statistics computed from pairs of nested models should be asymptotically 
independent of each other. Also, the correlations among the difference chi-squares 
should be independent of each other. 
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Besides the difference chi-square, alternative model selection procedures are 
also pursued in this paper for comparison. A brief description is included here for 
handy reference. The first one is by means of Akaike’s information criterion, or AIC, 
which is defined as (see Bozdogan, 1987) 

AIC = -2 In L + 2m, (11) 

with m denoting the number of parameters estimated by an IRT model. The model of 
choice is usually regarded as the one that yields the lowest AIC value. This is by 
virtue of its definition; AIC penalizes the more complicated models in favor of the 
more parsimonious models. This index is included in the present study because its 
performance in terms of selecting an IRT model is not very well known. Another 
alternative is the m k index which was originally suggested in the context of structural 
equation modeling (McDonald, 1988), but later introduced by McDonald and Mok 
(1995) as a measure of the goodness-of-fit for IRT models. It is defined as follows: 

- 1 (d k > 

m k =e 2 , . (12) 

where d k is, in turn, defined as (G 2 - df)/N. Here d k is actually a measure of the 
non-centrality parameter. The index m k has the property that its values are scaled 
within the range from zero to one, and that the larger its value, the better the fit of the 
corresponding model. Since this is a relatively new statistic, its performance is not 
quite known and is thus included for investigation in the present study. Unfortunately, 
BILOG (Mislevy, Bock, 1990) does not report G 2 for tests longer than 10 items; 
hence the study of the performance of m k is confined to the test length =5 situation 
only. 
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III. Methodology 

In this section, an overview of the research design is first described followed 
by a discussion of the specific details as well as some explanation of the rationale. 

A. Overview 

Two hundred replications of simulated test data for two test lengths (items =5 
and 50) in combination with two sample sizes (N=1000 and 2000) were generated 
from an existing item bank according to the 1-PL, 2-PL and 3-PL models. The 
combination of 5 items and 1000 subjects has been used in a number of studies and 
serves as a base for comparison. With knowledge of the true model underlying each 
data set, other IRT models were fitted to the data and the corresponding difference 
likelihood ratio chi-square statistic was calculated. These difference chi-square 
statistics were then assessed to see if they were approximately distributed as a chi- 
square distribution. Also, the intercorrelations of the various chi-square statistics 
were computed. 

The likelihood ratio chi-square statistic is usually computed after item 
parameter estimates that will maximize the likelihood function are chosen. But as 
discussed earlier, the characteristics of the underlying 0 distribution may affect the 
estimation of the item parameters and may result in an incorrect estimation of the 
likelihood ratio chi-square statistic. In order to minimize this problem, the marginal 
maximum likelihood estimation procedure was used in this study to estimate the item 
parameters, while assuming that the underlying 0 distribution was assumed to be 
known for the 5-item test. The ability distribution for the 50-item test were 
empirically estimated. All analyses were performed using the BILOG software. 

B. The Simulation oTTest Data 

Test Length C5. 501 : The items together with their item parameters used to generate 
the two simulated tests in this study were selected from an existing Math Item Bank. 
At present it contains about 220 test items and was constructed by one of the public 
schools on the Eastern shore. First, 5 items were randomly selected to form the 5- 
item test. In practice, a 5-item test is too short to precisely measure an examinee's 
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ability. There were two reasons for constructing the 5-item test. The first reason was 
that the expected frequency for each possible response pattern would probably be 
larger than 5. thereby meeting the requirement of the chi-square test. In this case, the 
total possible number of response patterns would amount to 32 (=2 5 ). The expected 
frequency for each pattern would then be 3 1 .25 for a sample size of 1000 examinees. 
Another reason was to follow the tradition of a long line of research, where two 
sections of the Law School Aptitude Test (LSAT), each with 5 items, was first studied 
by Bock & Lieberman (1970) and later re-analyzed by Bock & Aitkin (1980), 
Bartholomew (1980), Christo ffersson (1975), McDonald & Mok (1995), and Muthen 
(1978). Thus, a 5-item test was included in the present study and to serve as a base 
for comparison of the appropriateness of applying the likelihood ratio chi-square test 
for model selection. 

Next to produce the 50-item test, an additional 45 items were randomly 
selected from the item bank and combined with the five previously selected items. 
Tables 1 and 2 present some descriptive statistics of the item parameters used to 
simulate the short and the longer tests. Notice that when test-score data were 
simulated according to the one-PL model later on, the values of 1.0 and 0.0 were 
assigned for all the discrimination indices a, and the guessing parameters c t , 
respectively. For the two-PL datasets, the values of 0.0 were used for all the guessing 
parameters. 

Table III-l 



Descriptive Statistics of the Item Parameters for the 5-Item Test 

a : 5 c 



Model 


Mean 


Range 


Mean 


Range 




Mean 


Range 




One-PL 


1.00 


1.00 to 1.00 


-0.70 


-1.66 to 


0.29 


0.00 


0.00 to 


0 . 00 


Two-PL 


0 . 80 


0.60 to 1.03 


-0.70 


-1.66 to 


0.29 


0.00 


0.00 to 


0.00 


Three-PL 


0 . 80 


0.60 to 1.03 


-0.70 


-1.66 to 


0.29 


0.13 


0.06 to 


0.20 
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Table III-2 

Descriptive Statistics of the Item Parameters for the 50-Item Test 



Model 


a 

Mean 


Range 


b 

Mean 


Range 




c 

Mean Range 


One-PL 


1 . 00 


1.00 to 


1.00 -0.40 


-2.50 to 


3.00 


0.00 0.00 


Co 0.00 


Two -PL 


0 . 90 


0.30 to 


1.40 -0.40 


-2.50 to 


3.00 


0.00 0.00 


Co 0.00 


Three-PL 0 . 90 


0.30 to 


1.40 -0.40 


-2.50 to 


3.00 


0.16 0.04 


Co 0.31 



Ability and Sampl e Sizes (1000. 2000) : In this study, the ability parameters were 
randomly selected from the standard normal distribution, N(0,1). First, 1000 ability 
parameters were selected and used for N=1000 datasets. Then, for N=2000 datasets, 
an additional 1 000 ability parameters were selected and combined with the previous 
ability parameters of the 1000-sample size datasets. This way of constructing ability 
parameters has previously been used by McKinley and Mills (1985). These ability 
parameters were held constant across the 200 replications of data under each 
combination of study conditions. The reason for this decision was to retain the same 
metric for the estimated item parameters across the 200 replications for the test 
length=5 situation. (But see the discussion in the Calibration and Analysis subsection 
below for the test length=50 situation). Furthermore, since the likelihood ratio chi- 
square was calculated after the estimation of the item parameters, the above procedure 
retained the same metric for the likelihood ratio chi-square statistic across the 200 
replications within any specific study condition. 

SimulationLOf Datasets : The probability of each examinee answering an item 
correctly was computed according to Equation (1) or the like depending on the 
underlying IRT model. Uniform random numbers in the interval [0,1] were then 
generated and compared with the examinees’ probabilities of success. If the 
probability was larger than the corresponding generated random number, the 
examinee was scored 1, otherwise the examinee was scored 0. A total of twelve 
combinations of conditions (two test lengths X two sample sizes X three IRT models) 
were considered in this study. Two hundred replications were generated under each 
condition. 
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C Calibration and Analysis 

All likelihood ratio chi-square statistics were computed by using BILOG. Two 
options in BILOG deserve further explanation. 

The first one is the FREE option, which when adopted, will instruct the program to 
empirically estimate the 0 distribution of the respondents. Otherwise, the default is to 
assume the ability parameter to be distributed as a unit normal. In the present study, 
this option was invoked for the test length=50, but not for the test length=5 situation. 
This was because for a short test, the empirical posterior of the distribution of the • 
ability parameter may not be accurate. Hence, the ability distribution was assigned 
the default unit normal distribution, which is the same as the underlying distribution 
of the simulated dataset. For a longer test, the empirical posterior can be quite 
accurately estimated, and so the FREE option was adopted. Moreover, when FREE 
was used for the test length=50. owing to sampling fluctuation, the estimated posterior 
ability distribution might be a little bit different from replication to replication. 
Consequently, the metric of the item parameters might also be different from 
replication to replication. Strictly speaking, item linking should be performed. 
However, such differences should be minimal as each dataset was generated from the 
identical set of ability parameters which were held constant across replications across 
each combination of study condition. Finally, in a real life situation, the ability 
distribution is actually unknown. The FREE option was used to estimate the latent 

ability distribution. 

The second one is the FLOAT option. If this option is adopted, the means of 
the item parameter prior distributions will be estimated along with the item 
parameters (see Mislevy & Bock, 1990). Otherwise, the means of the item parameters 
distribution will be fixed at their default values during the estimation process. In this 
study, FLOAT was invoked for test length=50, but not for test length=5. 

With the knowledge of the true model behind each dataset, three IRT models 
(1-PL, 2-PL, and 3-PL) were fitted to each of them and the corresponding likelihood 
ratio chi-square statistic assessed. Two hundred likelihood ratio chi-square statistics 
were then separately obtained for the replications within each combination of 
designed conditions. Afterward, the following analyses were conducted: 

14 
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(1) . In order to determine if the difference chi-square statistic was really 
chi-square distributed, the distributions of the observed G 2 dlf statistics were examined 
by comparing them to the theoretical central chi-square distributions with the 
appropriate degrees of freedom. Following Holt & Macready (1989), each 
distribution of 200 observed G 2 dit statistics was classified into 10 intervals as defined 
by the set of 0th, 10th, 90th, and the 100th quantiles of the central chi-square 
distribution with 9 degrees of freedom. Each of the intervals will, therefore, contain 
an expected frequency count of 20. A Pearson chi-square statistic with 9 degrees of 
freedom was computed to assess the fit of the observed G 2 dif to a central chi-square 
distribution. 

(2) . In addition to the overall fit, each of the observed G 2 dif distributions was 
examined by comparing its observed mean and standard deviation with its 
corresponding expected mean and standard deviation. 

(3) . The intercorrelations among the likelihood ratio chi-square statistics computed 
for each model as well as their relationship with the various difference chi-squares 
computed from pairs of hierarchically related models were examined. 

(4) . The proportion of times when the true models were correctly chosen over the 
attempted models by the difference chi-square, the AIC and the m k indices were also 
computed for comparison purposes. 
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IV. Results and Discussion 

A. The Distribution of the Difference Likelihood Chi-square Test 

The observed distributions of the G 2 dit for each combination of study condition 
were first examined by comparing the observed mean and standard deviation with 
their expected values. If the difference statistics were really distributed as a chi- 
square distribution, then for test length=50, the theoretical mean of the difference chi- 
square between the 1-PL and the 2-PL models would be 50 and the corresponding 

standard deviation would be 10 (S.D. = yjldf = VlOO ). The values for the difference 
between the 1 -PL and 3-PL, and between the 2-PL and 3-PL were similarly 
calculated. The results are listed in Table IV-1. As seen there, the observed means 
and standard deviations were not close to the corresponding expected values. It was 
especially the case for the situation when sample size=2000, and when simpler models 
were fitted to datasets that were generated by more complicated models. 



Tabled - 1 

Comparisons of the observed means and standard deviations of difference chi-square 
statistics, G 2 dir , with their expected values when test length=50 (Sample Sizes = 1000, 
2000; Replications = 200) 



Model 








True Model 






Comparison 


















One-PL 


Two-PL 


Three-PL 




N 


1000 


2000 


1000 


2000 


1000 


2000 


1 VS 2 


M ( 50 . 00 ) 
SD( 10 . 00 ) 


45.97 
18 . 67 


30.60 

25.86 


655.36 

52.52 


1260.59 
105. 15 


597 . 28 
45.25 


1159.30 

65.73 


1 VS 3 


M ( 100 . 0 ) 
SD ( 14 . 14 ) 


90.34 

19.81 


107 . 18 
34.57 


699.82 

50.99 


1342.49 

103.95 


706 . 14 
50.71 


1352.38 

72.93 


2 VS 3 


M ( 50 . 00 ) 
SD( 14 . 14 ) 


44.38 
20 . 47 


76.58 

37.23 


44 . 46 
18 . 14 


81.90 

26.93 


108.86 

21.12 


193.08 

27.95 



* Expected statistics are given in parentheses. 



The distributions of the difference chi-square were then examined according to 
their overall fit to central chi-square distributions with 9 degrees of freedom. The 
results are presented in Table IV-2 below. All the goodness-of-fit tests were 
statistically significant at the .05 level, and hence none of the distributions of the 
observed difference chi-square were distributed as a central chi-square distribution. 
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Similar to Table IV- 1. Table IV-2 indicates that the goodness-of-fit was worse under a 

larger sample size, and when the true model was more complicated than the attempted 

model. 

Table IV-2 

The Pearson chi-square statistics for assessing the fit of the observed G ? djf statistics to 
a central chi-square distribution when test length=50 (Sample Sizes =1000, 2000; 
Replications = 200) 



True Model 

One-PL Two-PL Three-PL 



Sample Size N 

1000 2000 1000 2000 1000 2000 

Model Comparison 

1 vs 2 113.50 604.30 1800.00 1800.00 1800.00 1800.00 

1 vs 3 148.20 179.10 1800.00 1800.00 1800.00 1800.00 

2 vs 3 291.30 746.50 162.80 888.30 1663.70 1800.00 



For test length=5, the observed and the expected means and standard 
deviations of the difference chi-square between pairs of nested models are presented 
in Table IV-3. The observed descriptive statistics were not close to the corresponding 
expected values. When the 3-PL was attempted to fit to datasets generated either from 
the 1-PL or the 2-PL model, the means of the difference chi-square statistics took on a 
negative value, which is rather unusual. 



Table IV-3 

Comparisons of the observed means and standard deviations of difference chi-square 
statistics, G dif , with their expected values when test length=5 (Sample Sizes = 1000, 
2000; Replications = 200) 



Model True Model 

Comparison 

One-PL Two-PL Three-PL 









N 


1000 


2000 


1000 


2000 


1000 


2000 


1 


vs 


2 


M (5.00) 


4.25 


3.78 


13.39 


23.66 


7 . 57 


10.21 








SD (3.30) 


3.42 


2 . 90 


7.29 


9.45 


4.65 


6.36 


1 


vs 


3 


M (10.0) 


-1.71 


-2 .25 


9.94 


19.14 


10.86 


15.19 








SD(4 . 47) 


6.07 


5.83 


7.97 


10.30 


6.25 . 


8.97 


2 


vs 


3 


M (5.00) 


-5.96 


-6.02 


-3.45 


-4.52 


2.29 


4.99 








SD(3 .30) 


4.77 


5.34 


4 . 41 


5.06 


4.05 


6.26 



* Expected statistics are given in parentheses. 
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The distributions of the difference chi-square were then examined in exactly 
the same way as under test length=50. The goodness-of-fit statistics are presented in 
Table IV-4. The goodness-of-fit tests were again statistically significant at the .05 
level, and the distributions of the observed difference chi-square under test length=5 
were not distributed as central chi-square distributions either. 



Table IV-4 

The Pearson chi-square statistics for assessing the fit of the observed G 2 djf statistics to 
a central chi-square distribution when test length=5 (Sample Sizes =1000, 2000; 
Replications = 200) 







One-PL 


True Model 

Two-PL Three-PL 


- 






1000 


2000 


Sample Size N 
1000 2000 


1000 


2000 




Model Comparison 
1 vs . 2 45.30 

1 vs. 3 336.30 

2 vs . 3 1531.20 


46.60 
1387 .80 
1495 . 00 


765.80 
95.30 
1301 . 60 


1682.30 
555 . 90 

1370.30 


141 . 10 
44.50 
137.60 


309.60 

260.60 
152 .80 





One possible reason why the difference statistics were not distributed as 
central chi-square distributions may parallel the study by Holt and Macready (1989) in 
the context of latent class analysis. In both situations, the more parsimonious models 
(e.g. 1-PL, 2-PL) were obtained from the subsuming model (the 3-PL in this study) by 
constraining some parameters (here the guessing parameters) to their boundary values 
(zero in this study). Hence, a regularity condition was violated and the difference 
statistics may not have been chi-square distributed. 

As regards the anomaly of obtaining negative difference chi-squares under 
marginal maximum likelihood estimation for test length=5, with prior distribution of 
0 fixed (see Table IV-3), the problem may be related to the fact that this test is 
basically an easy test (see Table III-l). But in this study, the 0 distribution was fixed 
at N(0,1), so there were relatively fewer low ability parameter values generated. 

Hence, the guessing parameters were not appropriately estimated as there were few 
observations available, thus rendering their standard errors very large. Under these 
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circumstances. Thissen & Wainer (1982) indicated that "the large covariance between 

lower asymptote and location (difficulty) then causes this uncertainty to move 

partially to the estimate of location. With more difficult items the effect is lessened 

somewhat. The two-parameter model has problems as well, but they are far less 

severe” (Thissen & Wainer. 1982, pp. 403-404). For this reason, both the difficulty 

and the guessing parameters were not accurately estimated under the 3-PL. The final 

model may not “fit” better than the 1-PL and 2-PL models, thereby producing 

negative difference chi-square statistics. This was especially the case when the 

datasets were generated from the 1-PL model. The problem of negative difference 

statistics did not occur in test length=50 because the item parameters were quite 

accurately estimated. Also for a longer test, the estimate of the marginal probability 

of a response pattern is more reliable and accurate. 

B. The Proportions of Times the True Models Were Correctly Chosen 

The proportions of times the true models were chosen over the attempted 
models by the difference chi-square and AIC for test length=50 under various 
situations is presented in Table IV-5. Selection of nested models by the m k index is 
not considered for the reason explained in section 2 above. 

Although the difference chi-square statistics were not central chi-square 
distributed, their performance in terms of the proportions of times the true models 
were correctly chosen over the attempted model were relatively high, ranging from 
.885 to 1 .00 for sample size=1000. However, when the sample size=2000 and when 
the 3-PL was attempted to fit data generated either from the 1-PL or the 2-PL, the 
proportions of times the true models were correctly chosen can be quite low, ranging 
from .33 to .725. As seen from Table IV-5 regarding the performance of AIC, the 
proportions of times the true models were correctly chosen were quite high, especially 
when the true models were less complicated than the attempted models. This is 
basically consistent with the knowledge that AIC penalizes the more complicated 
model. 
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Table IV-5 

The proportion of times the true models were chosen over the attempted models by 
the difference chi-square and AIC for test length = 50 (Sample Sizes=1000. 2000; 
Replications = 200) 



Model True Model 

Comparison 

One-PL Two-PL Three-PL 





N 


1000 


2000 


1000 


2000 


1000 


2000 


1 vs 

1 vs 

2 vs 


2 

3 

3 


0 . 885 
0 . 955 


0 . 915 
0 .725 


1.000 

0.900 


1.000 

0.325 


1.000 

0.965 


1.000 

1.000 


AIC 




0 . 995 


0 . 990 


0 . 990 


0.725 


0.675 


1.000 



The results for test length=5 are presented in Table IV-6. Here the difference 
chi-squares were less likely to identify the true models, especially when the true 
models were more complicated than the attempted models. The performance of AIC 
was even worse when the underlying models were more complicated models. Based 
on Table IV-6, the m k index appeared to prefer models in this order: 2-PL, 1-PL, then 
3-PL, regardless of sample sizes. 



Table IV-6 

The proportion of times the true models were chosen over the attempted models by 
the difference chi-square, AIC and m k for test length = 5 (Sample Sizes =1000, 2000; 
Replications = 200) 



Model True Model 

Comparison 

One-PL Two-PL Three-PL 





N 


1000 


2000 


1000 


2000 


1000 


2000 


1 vs . 


2 


0 . 945 


0 . 970 


0.585 


0.945 






1 vs . 


3 


1.000 


1.000 






0 . 125 


0.320 


2 vs . 


3 






0.990 


1.000 


0.030 


0.135 


AIC 




1.000 


1.000 


0.000 


0.000 


0.000 


0.000 






0 . 685 


0.725 


0.870 


0 . 940 


0.285 


0.425 



Basic descriptive statistics for the m k index are presented in Table IV-7 below. 
Obviously, most of the values were very close to the upper limit, and the spread of 
values was extremely small. 




BEST COPY AVAILABLE 

2C 



Model Selection in IRT Models 19 



Table IV-7 

The descriptive statistics of Mean and Standard Deviation for m k index 
(Test Length = 5: Sample Sizes=1000. 2000; Replications = 200) 



Attempted 

Model One-PL 


1 rue Model 
Two-PL 


Three-PL 


1000 2000 
M SD M SD 


N 

1000 2000 
M SD M SD 


1000 
M SD 


2000 
M SD 


One-PL . 999 < .004) 1.0001 .002) 
Two-PL . 998 ( . 003 ) . 9991.002) 
Three- PL. 993 ( . 003) .997{.002) 


.994 ( . 005) .995 ( . 003) 

.999 ( .004) 1 . 000 ( .002) 
. 99 4 ( .004) . 997 ( . 002 ) 


. 997 { .004) 
. 99 8 { . 0 04 ) 
.997 ( . 003) 


. 9 9 8 ( .002) 
.999 ( . 002) 
. 9 99 ( . 0 0 1 ) 



In addition, for test length=5, the likelihood ratio goodness-of-fit chi-square 
statistic for testing the attempted model against a general multinomial alternative can 
be computed according to Equation 9. The proportions of times the attempted models 
were identified to fit the datasets by the likelihood ratio chi-square are presented in 
Table IV-8. When the true model was the 1-PL, it can be computed from the table 
that the Type I error rates of the goodness-of-fit test in identifying the true model 
amounted to 0.12 and 0.09, for sample size=1000 and 2000 respectively. Likewise, 
when the true model was the 2-PL, the corresponding Type I error rates were 0. 1 1 and 
0.10 for the two respective sample sizes. Lastly, when the true model was the 3-PL, 
the corresponding Type I error rates jumped to 0.265 and 0.175 respectively. In all 
cases, the probability of committing Type I error by the goodness-of-fit test statistics 
exceeded the 0.05 nominal alpha rate. In addition, the values off the diagonal in Table 
IV-8 denote the probabilities of committing Type II error under various situations. As 
can be seen there, the power of the likelihood ratio chi-square test can be quite low. 



Table IV-8 

The proportions of times the attempted models were identified to fit the datasets by 
the likelihood ratio chi-square goodness-of-fit test (Test Length = 5; Sample Sizes 
=1000, 2000; Replications = 200) 



Attempted True Model 

Model 





One-PL 


Two-PL 


Three-PL 


N 


1000 


2000 


1000 2000 


1000 


2000 


One-PL 


0 . 880 


0.910 


0.620 0.290 


0.750 


0.645 


Two-PL 


0 . 865 


0 .855 


0.890 0.900 


0.845 


0.845 


Three-PL 


0.300 


0.340 


0.455 0.480 


0.735 


0.825 
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C. The Intercorrelation Matrix among Chi-sauare Statistics 

The intercorrelations among the various likelihood ratio chi-squares as well as 
with the difference chi-squares computed from pairs of nested models when the test 
length=50 are provided in Table IV-9. The numbers in parentheses are correlations 
under sample size=2000, while those without parentheses are correlations under 
sample size=1000 situation. 



Table IV-9 

The intercorrelations among the various likelihood ratio chi-squares as well as with 
the difference chi-squares computed from pairs of nested models when the test 
length=50 (Sample Sizes =1000, 2000; Replications =200) 



True Attempted Model 

Model One Two Three D12 D23 D13 




In the table, datasets that were actually generated by the 1-PL, 2-PL and 3-PL 
models were given the numerical labels _1, _2, and _3, respectively. The verbal 
labels ONE, TWO, and THREE represent the attempted models to fit the data were 
the 1-PL, 2-PL and the 3-PL, respectively. For example, the number .99 in the first 
row and second column of the matrix represents a strong positive linear correlation 
between the likelihood chi-square statistics produced by fitting the 1-PL to datasets 
generated by the 1-PL model with those produced by fitting the 2-PL model to the 
same datasets. Likewise, the number .97 in the second row and second column 
represents a strong positive linear correlation between the likelihood chi-squares 
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produced by fitting the 1 -PL to datasets generated by the 2-PL model with those 
produced by fitting the 2-PL model to the same datasets. 

The label D12 represents the difference chi-square statistics produced by 
fitting the 1-PL and 2-PL to datasets generated by the same model. Consider, for 
example, the number .01 in the first row and fourth column of the matrix. Here the 
underlying true model is the 1-PL, so .01 denotes no linear correlation between the 
likelihood chi-squares produced by fitting the 1-PL with the difference chi-squares 
derived from fitting the 1-PL and 2-PL to the same datasets. 

Apparently, the upper left quadrant of the matrix indicates that, regardless of 
what the true model was, the intercorrelations among the likelihood chi-squares 
derived from various models were very high. The correlations between the likelihood 
chi-squares and the various difference chi-squares in the upper right quadrant were 
quite weak, indicating that they were independent from each other. One reason 
behind this observation is that, using 1-PL and 3-PL for illustration, the G 2 values for 
the 3-PL models are larger than those computed for the 1-PL models for some 
datasets, while smaller for the other datasets. 

Finally, the intercorrelations among the differences chi-squares could be quite 
high, which is different to those found in Steiger et al. (1985). It should be pointed 
out, however, that the theorems in Steiger et al. were stated in relation to noncentral 
chi-square distributions. 

The results for test length=5 are presented in Table IV- 10 below. Basically, 
the same patterns were found conformable to the previous table, except perhaps the 
upper right quadrant. There the correlations were moderately high in some instances. 
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Table IV- 10 

The intercorrelations among the various likelihood ratio chi-squares as well as with 
the difference chi-squares computed from pairs of nested models when the test 
length=5 (Sample Sizes =1000. 2000; Replications =200) 



True 

Model 


AC cempced 
One 


Model 

Two 


Three 


D12 




D23 




D13 




_1 


One 


1. 0(1.0) 


. 90 ( . 93 ) 


. 65 ( .73) 


- 55 ( . 


33) 


. 3 1 ( . 


16) 


• 55 ( . 


31) 


_2 




1. 0(1.0) 


.70 ( . 61) 


.63 ( .50) 


. 74 ( . 


82) 


.02 ( . 


13) 


• 69 ( . 


81) 


_3 




1. 0(1.0) 


. 84 ( .76) 


. 69 ( .42) 


• 54 ( . 


73) 


. 3 6 ( . 


42) 


. 63 ( . 


81) 


_1 


Two 




1.0 (1 .0) 


.73 ( .77) 


• 12 ( - . 


04) 


. 3 3 ( . 


21) 


. 32 ( . 


17) 


_2 






1.0 (1 .0) 


. 8 1 ( .73) 


• 05 ( . 


04) 


. 18 ( . 


35) 


. 14 ( . 


21) 


_3 






1.0 (1 .0) 


. 83 ( .50) 


-.01( . 


ID 


. 41 ( . 


61) 


. 26 ( . 


49) . 


_1 


Three 




1.0 (. 1 . 0 ) 


. 06 ( . 


02) 


- . 40 ( - . 


47) 


- . 28 ( - . 


41) 


_2 








1 . 0(1.0) 


. 13 ( . 


10) 


-.42 (-. 


38) 


-.12 (-. 


09) 


_3 








1 . 0(1.0) 


-.03( . 


11) 


- . 17 ( - . 


39) 


- . 13 (- . 


20) 


-1 


D12 








. 1.0(1 


.0) 


• 07 ( - . 


10) 


.62 ( . 


41) 


-2 










1.0(1 


.0) 


-.14 (-. 


09) 


. 84 ( . 


87) 


-3 










1.0(1 


.0) 


.02 ( . 


01) 


-76 ( . 


72) 


-1 


D23 












1.0(1 


.0) 


. 83 ( . 


87) 


-2 














1.0(1 


.0) 


. 43 ( . 


41) 


-3 














1.0(1 


.0) 


. 67 ( . 


71) 


-1 


D13 
















1.0(1 


,.0) 


-2 


















1.0(1 


,.0) 


-3 


















1.0(1 


,.0) 
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V. Conclusion 

All in all. based on the above examination and discussion, it is clear that the 
usual practice of treating the difference chi-square as distributed as a central chi- 
square distribution is not sound. For short test length, the proportion of times the 
correct model is being selected can be very low. It appears that the difference chi- 
squares are more likely to be distributed as a noncentral chi-square distribution. 

Hence a natural extension of the present study is to estimate the noncentral parameter 
and then test if the difference statistic is distributed as a noncentral chi-square with 
appropriate degrees of freedom. 

So far as the performance of the selection indices in the context of IRT is 
concerned, both the AIC and the m k indices are not very satisfactory. A promising 
index, namely, root mean square error of approximation (RMSEA), has recently 
drawn the attention of researchers in structural equation modeling (Steiger, 1980; 
McDonald & Mok. 1995). This index has not been pursued in the present study due 
to the fact that some of the G 2 values were less than their corresponding degrees of 
freedom. Apparently, more work needs to be done in the area of model selection 
within an IRT context. 
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